Policies produced by deep reinforcement learning are typically characterised by their learning curves, but they remain poorly understood in many other respects. ReLU-based policies result in a partitioning of the input space into piecewise linear regions. We seek to understand how observed region counts and their densities evolve during deep reinforcement learning using empirical results that span a range of continuous control tasks and policy network dimensions. Intuitively, we may expect that during training, the region density increases in the areas that are frequently visited by the policy, thereby affording fine-grained control. We use recent theoretical and empirical results for the linear regions induced by neural networks in supervised learning settings for grounding and comparison of our results. Empirically, we find that the region density increases only moderately throughout training, as measured along fixed trajectories coming from the final policy. However, the trajectories themselves also increase in length during training, and thus the region densities decrease as seen from the perspective of the current trajectory. Our findings suggest that the complexity of deep reinforcement learning policies does not principally emerge from a significant growth in the complexity of functions observed on-and-around trajectories of the policy.
translated by 谷歌翻译
In this paper, we propose a diffusion-based face swapping framework for the first time, called DiffFace, composed of training ID conditional DDPM, sampling with facial guidance, and a target-preserving blending. In specific, in the training process, the ID conditional DDPM is trained to generate face images with the desired identity. In the sampling process, we use the off-the-shelf facial expert models to make the model transfer source identity while preserving target attributes faithfully. During this process, to preserve the background of the target image and obtain the desired face swapping result, we additionally propose a target-preserving blending strategy. It helps our model to keep the attributes of the target face from noise while transferring the source facial identity. In addition, without any re-training, our model can flexibly apply additional facial guidance and adaptively control the ID-attributes trade-off to achieve the desired results. To the best of our knowledge, this is the first approach that applies the diffusion model in face swapping task. Compared with previous GAN-based approaches, by taking advantage of the diffusion model for the face swapping task, DiffFace achieves better benefits such as training stability, high fidelity, diversity of the samples, and controllability. Extensive experiments show that our DiffFace is comparable or superior to the state-of-the-art methods on several standard face swapping benchmarks.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Video-grounded Dialogue (VGD) aims to decode an answer sentence to a question regarding a given video and dialogue context. Despite the recent success of multi-modal reasoning to generate answer sentences, existing dialogue systems still suffer from a text hallucination problem, which denotes indiscriminate text-copying from input texts without an understanding of the question. This is due to learning spurious correlations from the fact that answer sentences in the dataset usually include the words of input texts, thus the VGD system excessively relies on copying words from input texts by hoping those words to overlap with ground-truth texts. Hence, we design Text Hallucination Mitigating (THAM) framework, which incorporates Text Hallucination Regularization (THR) loss derived from the proposed information-theoretic text hallucination measurement approach. Applying THAM with current dialogue systems validates the effectiveness on VGD benchmarks (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows enhanced interpretability.
translated by 谷歌翻译
尽管最近的凝视估计方法非常重视从面部或眼睛图像中提取与目光相关的特征,但如何定义包括凝视相关组件在内的特征是模棱两可的。这种模糊性使该模型不仅学习了与之相关的功能,而且还学会了无关紧要的功能。特别是,这对于跨数据库的性能是致命的。为了克服这个具有挑战性的问题,我们提出了一种基于数据驱动的方法,该方法具有数据驱动的方法,该方法具有生成的对抗网络反转的分解特征,以选择性地利用潜在代码中的目光相关特征。此外,通过利用基于GAN的编码器生成过程,我们将输入图像从目标域转移到源域图像,而凝视估计器充分了解了。此外,我们建议在编码器中凝视失真损失,以防止凝视信息的失真。实验结果表明,我们的方法在跨域凝视估计任务中实现了最新的凝视估计精度。该代码可在https://github.com/leeisack/latentgaze/上找到。
translated by 谷歌翻译
尽管已经通过深度学习技术开发了凝视估计方法,但没有采取诸如以50像素或更少的像素宽度或更少的像素宽度的低分辨率面部图像中准确性能的方法。为了在具有挑战性的低分辨率条件下解决限制,我们提出了高频专注的超级分辨凝视估计网络,即Haze-Net。我们的网络改善了输入图像的分辨率,并通过基于高频注意力块提出的超级分辨率模块增强了眼睛特征和这些边界。此外,我们的凝视估计模块利用眼睛的高频组件以及全球外观图。我们还利用面部的结构位置信息来近似头姿势。实验结果表明,即使在具有28x28像素的低分辨率面部图像中,提出的方法也表现出强大的凝视估计性能。该工作的源代码可在https://github.com/dbseorms16/haze_net/上获得。
translated by 谷歌翻译
本文介绍了一个新颖的成本聚合网络,称为变压器(VAT),称为体积聚集,以进行几次分割。变压器的使用可以通过在全球接收场上的自我注意来使相关图的聚集受益。但是,变压器处理的相关图的令牌化可能是有害的,因为令牌边界处的不连续性会降低令牌边缘附近可用的局部环境,并减少电感偏差。为了解决这个问题,我们提出了一个4D卷积的SWIN变压器,在该问题上,高维的SWIN变压器之前是一系列的小内核卷积,这些卷积将局部环境赋予所有像素并引入卷积归纳偏置。另外,我们通过在锥体结构中应用变压器来提高聚合性能,在锥体结构中,在更粗糙的水平上的聚集指导聚集在较好的水平上。然后,在查询的外观嵌入中,在随后的解码器中过滤变压器输出中的噪声。使用此模型,为所有标准基准设置了一个新的最新基准,以几次射击分段设置。结果表明,增值税还达到了语义通信的最先进的性能,而成本汇总也起着核心作用。
translated by 谷歌翻译
在多视图3D对象检测任务中,重叠图像区域的差异监督显着改善了整体检测性能。但是,当前的多视图3D对象检测方法通常无法正确检测重叠区域中的对象,并且网络对场景的理解通常仅限于单眼检测网络。为了减轻此问题,我们主张应用传统的立体声差异估计方法,以获取重叠区域的可靠差异信息。鉴于差异估计为监督,我们建议将网络正规化以充分利用双眼图像的几何潜力,并提高整体检测准确性。此外,我们建议使用对抗重叠区域的歧视器,该区域的训练以最大程度地减少非重叠区域和重叠区域之间的代表性差距,在这些区域中通常会因摄像机失真而在很大程度上被遮挡或因变形而遭受变形,从而导致域移动,从而导致域移动。我们用大规模的多视图3D对象检测基准(称为Nuscenes)证明了所提出的方法的有效性。我们的实验表明,我们提出的方法的表现优于当前最新方法。
translated by 谷歌翻译
现代回顾性分析系统利用级联体系结构减轻瓶颈来计算深神经网络(DNNS)。但是,现有的级联反应有两个局限性:(1)解码瓶颈要么被忽视或规避,要支付重大的计算和存储成本以进行预处理; (2)系统专门用于时间查询,缺乏空间查询支持。本文介绍了COVA,这是一种新颖的级联体系结构,该结构将压缩域和像素域之间的级联计算分开以解决解码瓶颈,从而支持时间和空间查询。 COVA级联分析分为三个主要阶段,其中前两个阶段是在压缩域中执行的,而在像素域中的最后一个阶段。首先,COVA检测一组压缩帧(称为轨道)上移动对象(称为斑点)的出现。然后,使用轨道结果,Cova谨慎地选择了一组最小的帧以获取标签信息,并仅解码它们以计算完整的DNN,从而减轻了解码的瓶颈。最后,Cova将轨道与标签相结合,以产生最终分析结果,用户可以处理时间和空间查询。我们的实验表明,COVA对现代级联系统提供了4.8倍的吞吐量改进,同时施加了适度的精度损失。
translated by 谷歌翻译
整个幻灯片图像(WSI)分类是诊断和治疗疾病的基本任务;但是,精确标签的策划是耗时的,并限制了完全监督的方法的应用。为了解决这个问题,多个实例学习(MIL)是一种流行的方法,它仅使用幻灯片级标签作为一个弱监督的学习任务。尽管当前的MIL方法将注意机制的变体应用于具有更强模型的重量实例特征,但注意力不足是对数据分布的属性的不足。在这项工作中,我们建议通过使用Max-Instance(关键)功能的统计数据来重新校准WSI袋(实例)的分布。我们假设在二进制MIL中,正面袋的特征幅度大于负面,因此我们可以强制执行该模型,以最大程度地利用公制特征损失的袋子之间的差异,该袋子将正面袋模型为未分布。为了实现这一目标,与使用单批训练模式的现有MIL方法不同,我们建议平衡批次采样以有效地使用功能丢失,即同时(+/-)袋子。此外,我们采用编码模块(PEM)的位置来建模空间/形态信息,并通过变压器编码器通过多头自我注意(PSMA)进行汇总。现有基准数据集的实验结果表明我们的方法是有效的,并且对最先进的MIL方法有所改善。
translated by 谷歌翻译